Goto

Collaborating Authors

 tan 1


The Rich and the Simple: On the Implicit Bias of Adam and SGD

Neural Information Processing Systems

Adam is the de facto optimization algorithm for several deep learning applications, but an understanding of its implicit bias and how it differs from other algorithms, particularly standard first-order methods such as (stochastic) gradient descent (GD), remains limited. In practice, neural networks (NNs) trained with SGD are known to exhibit simplicity bias -- a tendency to find simple solutions. In contrast, we show that Adam is more resistant to such simplicity bias. First, we investigate the differences in the implicit biases of Adam and GD when training two-layer ReLUNNs on a binary classification task with Gaussian data. We find that GD exhibits a simplicity bias, resulting in a linear decision boundary with a suboptimal margin, whereas Adam leads to much richer and more diverse features, producing a nonlinear boundary that is closer to the Bayes' optimal predictor. This richer decision boundary also allows Adam to achieve higher test accuracy both in-distribution and under certain distribution shifts. We theoretically prove these results by analyzing the population gradients. Next, to corroborate our theoretical findings, we present extensive empirical results showing that this property of Adam leads to superior generalization across various datasets with spurious correlations where NNs trained with SGD are known to show simplicity bias and do not generalize well under certain distributional shifts.





Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

arXiv.org Artificial Intelligence

Fine-tuning large language models (LLMs) for complex reasoning with reinforcement learning (RL) continues to be prohibitively expensive. Through a phenomenological investigation of GRPO post-training dynamics, we identify a scaling law characterized by exponential reward saturation. The emergence of this early plateau motivates an important question: can GRPO be equipped with principled early stopping criteria to significantly reduce post-training compute while preserving downstream performance? Across four open-source models--Llama 3B/8B and Qwen 3B/7B--we perform a systematic empirical study of GRPO fine-tuning and derive scaling laws that accurately predict reward trajectories during training. Our analysis shows that GRPO reward curves are well-approximated by an exponential saturation with three phases that are consistent across all models: (i) slow initial progress, (ii) rapid improvement, and (iii) saturation. We further show that a simple parametric scaling law, conditioned on model size, initial performance, and normalized training progress, reliably predicts the onset of plateauing performance. A key practical finding is that training beyond roughly 80% of a single epoch yields negligible reward gains while consuming a substantial fraction of total computation. Using our scaling law, practitioners can forecast these phase transitions early and select data-driven stopping points, substantially reducing GRPO compute without sacrificing final performance. Our results suggest that such predictive scaling laws are a promising tool for managing GRPO finetuning costs.


An operator splitting analysis of Wasserstein--Fisher--Rao gradient flows

arXiv.org Machine Learning

Wasserstein-Fisher-Rao (WFR) gradient flows have been recently proposed as a powerful sampling tool that combines the advantages of pure Wasserstein (W) and pure Fisher-Rao (FR) gradient flows. Existing algorithmic developments implicitly make use of operator splitting techniques to numerically approximate the WFR partial differential equation, whereby the W flow is evaluated over a given step size and then the FR flow (or vice versa). This works investigates the impact of the order in which the W and FR operator are evaluated and aims to provide a quantitative analysis. Somewhat surprisingly, we show that with a judicious choice of step size and operator ordering, the split scheme can converge to the target distribution faster than the exact WFR flow (in terms of model time). We obtain variational formulae describing the evolution over one time step of both sequential splitting schemes and investigate in which settings the W-FR split should be preferred to the FR-W split. As a step towards this goal we show that the WFR gradient flow preserves log-concavity and obtain the first sharp decay bound for WFR.




Divergence Phase Index: A Riesz-Transform Framework for Multidimensional Phase Difference Analysis

arXiv.org Machine Learning

We introduce the Divergence Phase Index (DPI), a novel framework for quantifying phase differences in one and multidimensional signals, grounded in harmonic analysis via the Riesz transform. Based on classical Hilbert Transform phase measures, the DPI extends these principles to higher dimensions, offering a geometry-aware metric that is invariant to intensity scaling and sensitive to structural changes. We applied this method on both synthetic and real-world datasets, including intracranial EEG (iEEG) recordings during epileptic seizures, high-resolution microscopy images, and paintings. In the 1D case, the DPI robustly detects hypersynchronization associated with generalized epilepsy, while in 2D, it reveals subtle, imperceptible changes in images and artworks. Additionally, it can detect rotational variations in highly isotropic microscopy images. The DPI's robustness to amplitude variations and its adaptability across domains enable its use in diverse applications from nonlinear dynamics, complex systems analysis, to multidimensional signal processing.


Training Variation of Physically-Informed Deep Learning Models

arXiv.org Artificial Intelligence

A successful deep learning network is highly dependent not only on the training dataset, but the training algorithm used to condition the network for a given task. The loss function, dataset, and tuning of hyperparameters all play an essential role in training a network, yet there is not much discussion on the reliability or reproducibility of a training algorithm. With the rise in popularity of physics-informed loss functions, this raises the question of how reliable one's loss function is in conditioning a network to enforce a particular boundary condition. Reporting the model variation is needed to assess a loss function's ability to consistently train a network to obey a given boundary condition, and provides a fairer comparison among different methods. In this work, a Pix2Pix network predicting the stress fields of high elastic contrast composites is used as a case study. Several different loss functions enforcing stress equilibrium are implemented, with each displaying different levels of variation in convergence, accuracy, and enforcing stress equilibrium across many training sessions. Suggested practices in reporting model variation are also shared.